In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import scipy.cluster.hierarchy as shc

from sklearn.preprocessing import MinMaxScaler

from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import KMeans

from sklearn.metrics import confusion_matrix
from sklearn.metrics import silhouette_score

from sklearn import datasets

%matplotlib inline
pd.set_option("display.max_columns", None)

# Lab 22 - Determining the number of clusters

We will look at two methods for determining the number of clusters. 

## Inertia and the elbow method
The first method assmes you have centers for the clusters, as in k-means clustering. It computes the sum of the squared distances of samples to their closest cluster center.

We'll load the iris dataset, as in Lab 20.

In [None]:
iris_dict = datasets.load_iris()

iris = pd.DataFrame(iris_dict.data, columns = iris_dict.feature_names)
iris.head()

Scale the data columns to be between 0 and 1.

Use k-means with k = 3 to compute the clusters. 

We can compute the sum of the squared distance of the samples to their closest cluster center as follows (`kmeans` should be the variable holding information about the k-means clustering algorithm).

In [None]:
kmeans.inertia_

To find the best k value, we make a loop to compute the inertia for each k, storing the result in a list.

In [None]:
inertia_list = []
for k in range(1,11):
 kmeans = KMeans(n_clusters=k, random_state=0)
 kmeans_clusters = kmeans.fit_predict(iris_scaled)
 inertia_list.append(kmeans.inertia_)

Plot the values in inertia_list. You can use `range(1,11)` as the x values.

The elbow method tells us to look for where the curve straightens into a line. That point is the suggested number of clusters.

We'll try this approach to determine the cluster number for the labor market data. Let's load in and clean the labor market data from the previous labs.

In [None]:
labor = pd.read_csv("../data/Feb2019_labor_market_majors.csv", skiprows = 13, \
 skipfooter = 3, index_col = "Major")
labor["Median Wage Early Career"] = labor["Median Wage Early Career"].str.replace(",","").astype(float)
labor["Median Wage Mid-Career"] = labor["Median Wage Mid-Career"].str.replace(",","").astype(float)

In [None]:
labor.head()

Create a new dataframe with the scaled data.

Run k-means clustering on the scaled data with k = 4.

Compute the inertia for 4 clusters.

What is the inertia if there is only 1 cluster? What is the inertia if every data point is its own cluster?

Compute the inertia for all values of k between 1 and 10 using a loop.

Now, plot the inertias as a line graph.

Where do you think the elbow is for this graph?

## Silhouette Score

Instead of computing the inertia, which requires a cluster center, we can compute the silhouette score. 

First the Silhouette Coefficient is calculated for each data point. If a is the mean distance from that point to all other points in its cluster and if b is the mean distance to all other points in the nearest cluster that the point is not part of, then the Silhouette Coefficient for a data point is 
$$\frac{b - a}{\max\{a,b\}}$$

The Silhouette Score is the mean silhouette coefficient for all data points.

Again compute the k-means clusters for the iris data set with k =3.

We can compute the silhouette score as follows.

In [None]:
silhouette_score(iris_scaled,kmeans_clusters)

We can find the value of k giving the lowest (best) silhouette score by using a loop to try different values of k, similarly to the elbow method. Try doing this below.

Do you get a similar answer as with the elbow method?

Now try using the silhouette score to find the best value of k for the labor data.

How does the k giving the best silhouette score compare to the elbow method?

## Starbucks drinks dataset

Try both the elbow method and the silhouette score to compute the optimum number of clusters for Starbucks drinks, based on their nutritional information. The original dataset is from Kaggle [here]